Molecular Systems Biology
○ Springer Science and Business Media LLC
Preprints posted in the last 7 days, ranked by how well they match Molecular Systems Biology's content profile, based on 142 papers previously published here. The average preprint has a 0.06% match score for this journal, so anything above that is already an above-average fit.
Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.
Show abstract
Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.
Zhou, G.; Williams, G.; Millner, M. T.; AlHirayban, R.; Alosaimi, W.; Fallatah, O.; Hart, A. J.; Malaikah, M.; Iftikhar, S.; Ahmad, H.; Roghanian, M.; Mustonen, V.; AlYami, R.; Banzhaf, M.; Moradigaravand, D.
Show abstract
Background Bacterial fitness is shaped by interactions between genome variation and environmental context, yet how these interactions determine its predictability and heritability remains unclear. In the clinically important pathogens of Klebsiella pneumoniae, a leading cause of hospital-acquired infections, this question is particularly pressing. Despite extensive genomic characterization, we still lack a systematic understanding of how genome-wide variation translates into fitness across diverse environments in K. pneumoniae. Methods We filled this gap by profiling a systematic collection of 1,462 clinical K. pneumoniae isolates across 214 diverse environmental and pharmacological stress conditions using high-throughput chemical genomics. Fitness was quantified from colony growth and integrated with whole-genome sequencing data. Genome-wide association analyses identified genetic determinants of fitness, and machine learning models incorporating genomic features were used to predict fitness.Results Fitness exhibited a strongly environment-dependent genetic architecture, with modest but significant concordance between genetic background and phenotypic variation. Under antibiotic and stress-combination conditions, fitness was driven by discrete, high-effect determinants, including known resistance genes, resulting in stronger signals and improved predictability. In contrast, non-antibiotic environments showed more polygenic and distributed architectures with weaker associations. Genome-wide analyses identified both established and previously uncharacterized genes linked with fitness across conditions. Resistance and virulence determinants exhibited clear context-dependent trade-offs, conferring fitness advantages under selection but imposing costs in non-selective environments. Consistent with this, plasmid carriage showed environment- and genotype-dependent fitness effects, with benefits under antibiotic pressure and measurable costs otherwise. Genomic variant-based models for fitness prediction achieved moderate performance (Mean Spearman correlation ({rho}) = 0.36 (95% CI: 0.18-0.67) for predicted versus observed values in unseen data) across conditions, with improved accuracy under strong antibiotic selective pressures, and produced well-calibrated prediction intervals with high coverage. Despite strong population structure effect on predictions, models captured predictive gene and SNP biomarkers for fitness. Conclusion These findings highlight that bacterial fitness is an emergent property of genome-environment interactions rather than a fixed attribute of genotype. This work establishes a unified high-dimensional genotype-phenotype framework linking genomic variation to fitness across diverse conditions in a major pathogen, with broader implications for other pathogenic bacterial species.
Borovoi, L.; Kahalon, R.; Edelstein, M.
Show abstract
Research on under-vaccination often segments populations using demographic or administrative variables that are operationally useful but fail to capture identity dimensions relevant to vaccination decisions. Drawing on social identity theory, we propose an identity-landscape approach distinguishing identity membership, identity centrality, and multidimensional identity structure. Using a cross-sectional survey of 1,000 UK parents, we measured 65 identity indicators, identity-importance ratings, and their association with attitudinal and behavioural hesitancy toward childhood vaccination using validated scales. Beyond established socio-demographic predictors, alternative-medicine and natural-lifestyle identities, as well as affiliation with social media networks, were linked to greater hesitancy. Greater centrality of religion and political affiliation within personal identity was also associated with higher hesitancy. Principal component analysis suggested that individuals actively engaged across multiple societal issues were more hesitant, whereas stereotypically male-gendered engagement was associated with lower hesitancy. An identity-focused population segmentation may identify previously unrecognized undervaccinated groups and inform innovative tailored immunization campaigns.
Zhang, C.; Chen, Y.-L.; Jamilov, A.; Liu, E.; Shree, S.; Lam, B. D.; Foy, B. H.
Show abstract
Most routine clinical markers are interpreted using population-based reference intervals, despite being regulated around patient-specific homeostatic setpoints. This mismatch obscures physiologic shifts, inhibiting detection of early disease signatures. Here, we develop a novel Bayesian inference method that adaptively constructs personalized reference intervals using each patients existing health records. In analysis of >100 million lab tests in >800,000 patients, these personalized intervals can be accurately constructed with only minimal prior data, meaning this method can be applied near universally. We show that across 43 common lab markers, patient setpoints are strongly associated with future morbidity, with signal strength increasing as more test data is collected. Deviation from personalized reference intervals provides strong and novel risk signatures across diverse disease states, including hypothyroidism, hematologic cancers, kidney disease, and pregnancy complications. Importantly, personalized reference intervals capture a different risk signature to existing population-based approaches, with the highest risk patients being those who deviate from both intervals simultaneously. In a targeted clinical use case study of iron infusion, use of personalized reference intervals greatly improved prediction of treatment efficacy and allowed precise tracking of treatment responses. Our results illustrate how existing health records can be used to construct personalized benchmarks for nearly all common clinical tests, driving a new paradigm for precision laboratory medicine.
Jacobs, L. A.
Show abstract
COVID-19 risk scores developed during the pandemic relied on measurements contemporaneous with infection, leaving unresolved whether the metabolic and inflammatory vulnerability they capture pre-existed as a stable trait or was triggered by acute illness. Here, using 501,946 UK Biobank participants whose blood was drawn between 2006 and 2010---at least ten years before SARS-CoV-2 emerged---we show that baseline proteomic and metabolic profiles predict both COVID-19 hospitalization (2,783 events; C-statistic =0.676 [0.666--0.686]) and COVID-19 mortality (1,564 deaths; C-statistic =0.730 [0.701--0.760]) from parsimonious, regularized feature sets. The IL-1 pathway index (xIL1, +0.093) was independently selected for hospitalization but not mortality, while the IL-6 trans-signaling index (xIL6, + 0.040) was selected for mortality but not hospitalization---a differential pathway weighting corroborated by independent LightGBM/SHAP analysis and mirroring the subsequent success of tocilizumab (anti-IL-6R) and the limited efficacy of anakinra (anti-IL-1R) in reducing COVID-19 mortality in randomized trials conducted years later. The mortality model was additionally characterized by central adiposity (waist-hip ratio, +0.386), a respiratory compromise index (xRSP, +0.149), and prodromal cardiovascular disease (pCVD, +0.246). These findings establish that vulnerability to a novel pathogen is, in substantial part, a pre-existing and measurable prodromal state, with implications for pandemic preparedness and population-level risk stratification.
Su, C.-Y.; Butler-Laporte, G.
Show abstract
Yang et al. recently published a systematic comparison of genetic effects on disease susceptibility and disease-specific mortality across nine common diseases and seven biobanks, concluding that susceptibility and survival architectures overlap only modestly. This is an important resource, but we argue that the current mortality genome-wide association studies (GWAS) require explicit power calibration before limited overlap can be interpreted biologically. Using two-sample Mendelian randomization (MR) with positive-control exposures, we show that even a well-powered positive control, body mass index (BMI), instrumented by 855 genome-wide-significant variants, produces a clearly detectable effect for heart failure (HF) mortality, with only weaker evidence for chronic kidney disease (CKD) mortality. However, when BMI instruments were stratified into quartiles by exposure-association strength, the heart failure association remained nominally significant only in the two strongest quartiles and was not significant in the two weakest quartiles. Further, using household income as a weakly instrumented socio-economic contrast has insufficient power to detect moderate effects on any disease mortality outcome. These analyses indicate that current disease mortality GWAS may be insufficiently powered to detect shared effects. In contrast, the same BMI instrument set produced large and directionally coherent effects when applied to case-control GWAS of the matched six diseases, with the HF and prostate cancer associations preserved under a within-family BMI sensitivity analysis, and nominal support for CKD. The HF mortality association was also preserved in a within-family BMI sensitivity analysis. Similarly, genetically proxied household income was associated with HF risk in the case-control GWAS despite null associations with disease-specific mortality, consistent with limited power in the mortality GWAS. These findings indicate that the limited BMI-mortality evidence across several outcomes is unlikely to reflect a weak BMI instrument or dynastic artefacts alone and instead supports limited effective power in current disease-mortality GWAS.
Shinde, S. N.; Shinde, R. S.; Bhangaaley, S. Y.
Show abstract
Background: Consensus continuous glucose monitoring (CGM) metrics, including time in range (TIR), time above range (TAR), time below range (TBR), mean glucose, glucose management indicator, and glycemic variability, are essential for modern glucose assessment. However, these whole-day summaries do not explicitly partition nocturnal basal from daytime ambulatory glycemic burden. Objective: To develop and evaluate a complementary domain-based CGM framework that quantifies basal and daytime ambulatory glycemic exposure across oral glucose tolerance test (OGTT)-derived dysglycemia phenotypes. Methods: In this observational, clinic-based study, 253 individuals underwent OGTT with insulin measurement and CGM. Participants were classified using a prespecified OGTT-derived phenotyping algorithm, implemented through a deterministic rules-based web calculator, and collapsed into five groups: NoDM, Increased insulin resistance, Midzone Glycemia, Prediabetes, and Diabetes. CGM files were uniformly reprocessed by selecting the latest contiguous episode and retaining the most recent 15 calendar days with data. The 24-hour profile was partitioned into nocturnal basal (00:00 to <06:00) and daytime ambulatory (06:00 to <24:00) domains. Derived indices included Area of Basal Glycemia (ABG), Area of Prandial/Daytime Ambulatory Glycemia (APG), incremental ABG (iABG), incremental APG (iAPG), and exploratory deficit indices dABG and dAPG. Results: The final dataset contributed 3,647 analyzable CGM days. APG remained higher than ABG across all groups. Mean ABG/APG increased from 80.45/86.38 mg/dL in NoDM to 111.96/124.70 mg/dL in Diabetes. Mean iABG/iAPG increased from 5.65/6.60 to 34.12/38.91 mg/dL, whereas dABG/dAPG declined as dysglycemia worsened. Conclusions: The ABG/APG framework provides interpretable, domain-resolved CGM burden metrics that separate basal from daytime ambulatory exposure and distinguish total burden from above-threshold excess. These indices are proposed as adjunctive metrics to support dysglycemia phenotyping, early risk recognition, and treatment monitoring, but are not intended to replace established consensus CGM metrics or diagnostic criteria. External, prospective validation is required.
Cavon, J.; Perez, C.; Quinn-Bohmann, N.; Magis, A. T.; Gibbons, S. M.
Show abstract
Emerging evidence links the gut microbiome to sleep quality, yet measuring sleep at scale remains challenging. Commercial wearables, such as Fitbit, capture objective sleep and activity data in naturalistic settings. We integrated Fitbit data from a large, deeply-phenotyped cohort with paired lifestyle and health questionnaires. Wearable-derived measures aligned well with self-reported sleep, activity, and happiness. We identified dozens of covariate-adjusted associations between Fitbit-derived sleep features, lifestyle factors, and multi-omic data. Among molecular feature sets, the gut microbiome showed the greatest number of associations with sleep quality: butyrate-producing genera were positively associated with sleep and amplified the benefits of physical activity. Oscillospira, in particular, was consistently associated with better sleep. In blood, insulin, omega-3, and cortisol correlated with poorer sleep, whereas lower alcohol intake and mineral supplements correlated with better sleep. These robust, covariate-adjusted findings advance mechanistic understanding of the gut-sleep axis and broader molecular and lifestyle determinants of sleep quality.
Tuttle, M.; Maas, C. C. H. M.; An, J.; Wessler, B. S.; Harvey, W. F.; Selker, H. P.; van Klaveren, D.; Kent, D. M.
Show abstract
The Epic Sepsis Model version 2 (ESMv2) is a prediction model embedded into the electronic medical record used to warn clinicians which hospitalized patients are at risk for sepsis. We conducted a retrospective cohort study of 31,951 hospitalizations of 25,760 patients to compare analyses conducted at the commonly used patient-level (where a maximum prediction prior to the onset of sepsis is used to measure performance) vs novel prediction-level (where each prediction is used to measure performance). Sepsis, defined by the Sepsis 3 criteria occurred during 1,049 hospitalizations (3.3%). Patient-level analyses suggested excellent discrimination AUC 0.86; [IQR 0.85, 0.87], whereas prediction-level analyses demonstrated lower performance AUC 0.62; [IQR 0.57, 0.65]. Low estimates of the positive predictive value (14.5% at the patient level vs 4% at the prediction level) imply a high number of false alerts. Common evaluation approaches may overstate the performance of dynamic prediction models and mislead clinical decision-making.
Alleman, T. W.; Van Wesemael, T.; Shanker, N.; Mietchen, M. S.; Loo, S.; Ajagbe, S. O.; Baetens, J. M.; Lemaitre, J.; Hill, A. L.; Truelove, S. A.; Bento, A. I.
Show abstract
Hybrid mechanistic-statistical models offer interpretability and adaptability for short-term seasonal epidemic forecasting, but it remains unclear whether their accuracy depends more on increased biological complexity or on the assimilation of richer data. Using eight retrospective influenza seasons in North Carolina, we evaluate whether training on historical data and assimilating auxiliary emergency department (ED) visit data improves four-week-ahead hospital admission forecasts more than adding biological complexity (multi-subtype structure and cross-season immunity). Hierarchical Bayesian training on historical data improves accuracy by 22.4 % (95 % CI: 16.4-28.1 %), and inclusion of ED visit data yields a further 5.3 % (95 % CI: 3.0-7.6 %) improvement, whereas added biological complexity produces diminishing or null gains. We further observe a substitution effect in which ED visit data partially compensates for omitted biological structure. We deployed a simplified model variant in the 2025-2026 CDC FluSight Challenge and ranked among the top ensemble performers, supporting the robustness of Bayesian hierarchical training in real time. Together, these findings indicate that short-term forecast accuracy is driven more by historical learning and assimilating auxiliary signals than by biological fidelity, with implications for how forecasting systems should balance mechanistic complexity.
Jiang, H.; Wang, X.; Vanky, E.; Parreira, D.; Derisoud, E.; Jannig, P. R.; Nordenhok, E.; Zhao, A.; Li, C.; Stridsklev, S.; Holzmann, M.; Li, X.; Luthander, C. M.; Stener-Victorin, E.; Deng, Q.
Show abstract
Polycystic ovary syndrome (PCOS) is linked to adverse pregnancy outcomes and increased cardiometabolic risk in offspring, yet the placental mechanisms underlying these risks remain poorly understood. Metformin is prescribed during PCOS pregnancies despite limited mechanistic justification. Using multi-modal molecular analyses of placentas from healthy controls and women with PCOS randomized to placebo or metformin (PregMet trial), restricted to uncomplicated pregnancies, we characterized direct PCOS associated placental alterations independent of confounding complications. PCOS placentas showed transcriptional downregulation across multiple cell types and shifts in cell type proportions. Specifically, syncytiotrophoblasts exhibited reduced expression activity of growth hormone receptor signaling and glycosaminoglycan biosynthesis. Endothelial cells displayed diminished receptor tyrosine kinase pathway activity, including VEGFC, despite increased cell proportion and hypervascularity. Intercellular communication networks were globally suppressed, including reductions in PDGF signaling from Hofbauer cells to fibroblasts. Notably, metformin did not reverse most PCOS-associated molecular alterations and induced transcriptional changes correlated to birth weight and childhood BMI. These findings indicate that PCOS-associated placental features are driven by cell type specific dysregulation of growth factor, angiogenic signaling pathways that are largely unresponsive to metformin. This underscores the need to develop mechanism based, placenta targeted therapeutic alternatives for future pregnancy management.
Totsune, E.; Nakajima, D.; Konno, R.; Mikami-Saito, Y.; Arai-Ichinoi, N.; Nishida, H.; Yagi, H.; Ishige, T.; Suzuki, H.; Shirota, M.; Takayama, J.; Takano-Asai, C.; Shimura, M.; Sasai, H.; Lee, T.; Kido, J.; Nakajima, Y.; Kobayashi, H.; Kikuchi, A.; Numakura, C.; Hamazaki, T.; Oishi, K.; Nakamura, K.; Kawashima, Y.; Ohara, O.; Wada, Y.
Show abstract
Background: Citrin deficiency, caused by biallelic pathogenic variants in SLC25A13, must be identified early to prevent serious complications such as hyperammonemia and liver failure. However, clinical diagnosis is often delayed due to its nonspecific presentation and limited sensitivity of amino acid-based newborn screening methods. Although genome-based evaluations are being investigated to address these issues, concerns about their cost, turnaround time, variant interpretation ability, and data handling highlight the need for a more practical yet reliable alternative. We investigated the feasibility of applying proteomic approach on dried blood spots (DBS), which are routinely used in newborn screening. Methods: We performed untargeted liquid chromatography-tandem mass spectrometry to analyze the proteome of DBS using a previously developed "non-targeted analysis of non-specifically DBS-absorbed proteins" (NANDA) workflow. SLC25A13 protein abundance was quantified in individuals with biallelic loss-of-function mutations, compound loss-of-function/missense mutations, and heterozygous carriers; this was also evaluated in healthy and diseased controls representing relevant differential diagnoses. To leverage proteomic information, we derived a multivariate proteomic signature using feature selection and evaluated its performance with leave-one-out cross-validation. Biological relevance was assessed by enrichment analysis, and complementary transcriptomics was performed using RNA sequencing. Results: A total of 7,474 proteins, including SLC25A13, were consistently detected in DBS. SLC25A13 was undetectable in individuals with biallelic loss-of-function mutations. However, individuals with compound loss-of-function/missense genotypes showed reduced but measurable SLC25A13 levels, comparable to those observed in heterozygous carriers. In contrast, a compact 15-protein signature accurately identified individuals with compound loss-of-function/missense genotypes (AUC, 0.99; sensitivity, 1.00; specificity, 0.95). The signature was enriched for Ca2+-response, and transcriptomics showed downregulation of genes related to multimodal ion channels in affected individuals compared to controls. Conclusions: DBS-based proteomic profiling may assist in the diagnosis of citrin deficiency through SLC25A13-quantification and a biologically plausible multivariate signature. More broadly, this strategy offers a promising new diagnostic layer for protein disorders, providing a proteomic readout in a clinically practical DBS format with potential utility for future diagnostic and screening applications.
Raghavan, S.; Liu, W. G.; Ho, M. R.; Warsavage, T.; Ghosh, D.; Caplan, L.; Reusch, J. E.
Show abstract
Objectives: Diabetes affects over 500 million people globally and glycemia is inadequately managed. Metformin is the most frequently prescribed initial treatment for type 2 diabetes globally, yet glycemic response trajectories to metformin in routine real-world care and predictors of treatment response have not been well described. We aimed to identify glycemic response trajectories in adults prescribed metformin monotherapy as initial type 2 diabetes treatment and predictors of poor glycemic response to metformin. Design: Observational cohort study using latent class mixed models to identify hemoglobin A1c (HbA1c) trajectory classes, followed by random forests machine learning to predict trajectory class membership. Setting: US Veterans Affairs Healthcare System Participants: Adults treated with metformin alone for >30 days after diabetes diagnosis with a minimum of two HbA1c measurements from 90 days prior to two years after the first metformin prescription (N=140,413). Exposures: Demographic, laboratory, vital sign, and comorbidity data were included as predictors of metformin response trajectory Main Outcomes and Measures: We included all HbA1c measurements (487,604 total) for two years after metformin initiation to define metformin glycemic response trajectories. Results: We identified three HbA1c trajectories: stably low (89.7% of sample, mean HbA1c decrease from 7.2% to 6.6%), brisk response (7.1% of sample, mean HbA1c decrease from 11.4% to 7.0%), and non-response (3.1% of sample, mean HbA1c increase from 8.9% to 10.8%). Of those in the stably low and brisk response classes at 2 years, 91% maintained HbA1c at approximately 7% on metformin alone for 5 years after drug initiation. Prediction models could accurately predict brisk response (91% accuracy) but not metformin non-response (59% accuracy). Conclusions: Most individuals treated initially with metformin monotherapy have a beneficial and durable glycemic response. Predicting individuals who will not respond to metformin may be challenging but is evident within six months with recommended glycemic surveillance. The findings support current guidelines for HbA1c surveillance when initiating diabetes treatment.
Escalera, M.; Lopez Ortiz, E.; Garcia Morales, C.; Cruz-Bonilla, E.; Guerrero Flores, S.; Weaver, S.; Matias Florentino, M.; Tapia Trejo, D.; Davila Conn, V.; Roberto Cardenas Porras, ; Eduardo Zarza Sanchez, ; Silvia del Arenal Sanchez, ; Jorge A Gutierrez Soto, ; Karina Nava Memije, ; Jessica Monreal Flores, ; Alejandro Guzman, ; Rebecca E Garcia Mendiola, ; Patricia Iracheta, ; Veronica Ruiz Gonzalez, ; Veronica Quiroz Morales, ; Israel Macias Gonzalez, ; Manuel A Becerril Rodriguez, ; Raul A Cruz Flores, ; Andrea Gonzalez Rodriguez, ; Dulce M Lopez Sanchez, ; Miroslava Card
Show abstract
Understanding HIV transmission in densely populated urban settings is essential to mitigate ongoing epidemic spread. We present a comprehensive analysis of recent HIV transmission dynamics in Greater Mexico City, one of the worlds largest metropolitan areas comprising Mexico City and neighbouring municipalities of the State of Mexico. Drawing from over 7,000 complete pol gene sequences representing around 50% of new cases reported between 2019 and 2022 within the study region, we reconstructed the transmission network based on pairwise genetic distance. We identified ten large transmission clusters exhibiting sustained growth up to the most recent sampling period. We further analysed paired genetic and high- resolution human mobility data using an integrated phylogeographic approach. We observed a heterogeneous pattern of viral spread across the region, supported by an extensive mixing at a wider geographic scale. Across Greater Mexico City, displaying a high population density, HIV transmission is minimally spatially constrained, a pattern likely fuelled by intense human mobility. Thus, population movement weakens isolation by distance in large urban areas even for a chronic infection that is sexually and vertically transmitted. We demonstrate the value of integrating large-scale genetic, epidemiological, and mobility data to resolve contemporary HIV transmission dynamics in densely populated urban settings
Liu, T.; Zeng, X.; Snitz, B. E.; Karikari, T. K.; Deek, R. A.
Show abstract
Blood biomarker models are increasingly used in Alzheimer's disease and related dementia translational research, but predictive performance can be inflated when the same dataset is used for both model development and evaluation. We assess the effect of data double dipping using simulations and NULISA proteomic data from the MYHAT-NI community-based cohort to predict brain amyloid-beta neuroimaging status. In both settings, training AUC increased as more biomarkers were added, while testing AUC peaked earlier and then declined. These findings show that data double dipping can inflate model performance and highlight the need for external validation or internal validation with data partitioning.
Berger, C. G.; Puttfarcken, B.; Qiu, J.; Hauer, I.; Herr, S.; Juestel, D.; Pleitez, M. A.
Show abstract
We present a compact pump-and-probe mid-infrared Optothermal Spectrometer (OTHES) equipped with Spatial Probing and Autocorrection (SPAC) optimized for robust intravital application in humans. SPAC-OTHES facilitates alignment stability and spectral comparability across different measurement sessions involving different skin types. Contrary to state-of-the-art, SPAC-OTHES uses camera-based beam detection and an auto-calibration mechanism that enables ca. 73% better spectral reproducibility in intravital measurements in human volunteers than non-calibrated readouts. Moreover, SPAC-OTHES has the potential to lower the glucose quantification error, as demonstrated here in artificial skin phantoms, where an improvement of 52% compared to conventional diode-based detection was observed. The compactness of OTHES, combined with reliable SPAC-readout, has the potential to accelerate commercialization and broad application of biosensors based on mid-infrared spectroscopy.
Mao, Y.; Lopman, B.; Koelle, K.; Lau, M. S.
Show abstract
Accurate forecasting of seasonal influenza is critical for public health preparedness, and data-driven models are central to this effort. However, most approaches rely on aggregate indicators of influenza-like-illness (ILI), which can obscure heterogeneity and limit predictability at longer horizons. While subtype dynamics are well established, their role in data-driven forecasting remains incompletely understood. Here, we integrate subtype-resolved surveillance data into diverse data-driven frameworks using over a decade of U.S. surveillance records to evaluate and decompose predictive signal in influenza forecasting. Across pre- and post-COVID-19 periods, subtype-informed models consistently improve over baseline models trained on aggregate ILI alone, with the largest gains at longer horizons. Decomposition reveals a horizon-dependent reorganization of predictability: autoregressive persistence in recent aggregate incidence dominates at short horizons but declines with lead time, while predictive signal shifts toward subtype-derived structure. Within this structure, interaction-related features among co-circulating subtypes grow systematically with forecast horizon, indicating that longer-term predictability is driven increasingly by interaction structure rather than marginal subtype composition alone. Together, our results show that subtype information provides non-redundant predictive signal and extends the effective forecasting window of data-driven models. More broadly, our findings suggest that aggregation of heterogeneous subtype processes can obscure latent predictability, supporting subtype-resolved surveillance.
Sharma, R.; Hu, F.; Li, X.; Campos, R.; Kundu, K.; Atanur, S.; Karpinski, M.; Wasilewski, S.; MacArthur, S.; Vitsios, D.; Dhindsa, R. S.; Georgakopoulos-Soares, I.; Burren, O. S.; Petrovski, S.; Mustoe, A. M.; Wang, Q.; Glodzik, D.; Zou, X. Z.
Show abstract
Non-coding variants are important contributors to human traits and diseases but linking them to molecular mechanisms and phenotypes at scale remains challenging. G-quadruplexes (G4s) are four-stranded structures formed by guanine-rich sequences and have emerged as key functional elements within the non-coding genome. G4s are enriched in regulatory regions and can modulate gene expression at both the DNA and RNA levels, influencing transcription, replication, and RNA processing, positioning them as key mediators linking non-coding variation to complex biological traits. Here, we profile putative G4s across five regulatory regions in 459,449 UK Biobank genomes and perform phenome-wide association analyses spanning 2,941 plasma protein abundances, 13,321 binary traits, and 1,682 quantitative traits. We show that putative G4-modifying variants are depleted under purifying selection despite elevated local mutability and drive large, bidirectional associations with plasma proteins and clinical traits, including associations not captured by coding variants. Using a mechanism-aware collapsing strategy that groups rare non-coding variants by their predicted impact on G4 stability, we achieved stronger gene-level signals than those obtained with standard rare-variant collapsing approaches. Integrating non-coding and protein-truncating variants (PTVs) increases discovery power, revealing 843 significant associations missed by the PTV-only model. Replication in the Alliance for Genomic Discovery cohort demonstrates cross-cohort robustness. Our study suggests G4s as widespread mediators of non-coding regulation and provides a framework for mechanism-informed target discovery and prioritization across the non-coding genome.
Heilman, A. M.; Warsavage, T.; Liu, W. G.; Wilson, P. W.; Phillips, L. S.; Reusch, J. E.; Raghavan, S.
Show abstract
Importance: Despite the benefits of statin therapy in individuals with diabetes, fewer than 70% of adults with diabetes meet contemporary guidelines for statin therapy and reducing low-density lipoprotein cholesterol (LDL) to <100 mg/dL. Evidence describing delays in statin initiation after diabetes diagnosis and associated clinical outcomes may motivate process of care interventions to improve guideline recommended care in individuals newly diagnosed with type 2 diabetes mellitus (T2D). Objective: To examine the timing of statin initiation and achievement of LDL <100 mg/dL after diabetes diagnosis, and to determine the association of early LDL reduction among statin initiators with incident atherosclerotic cardiovascular disease (ASCVD). Design: Retrospective observational cohort study using data from 2005-2021 Setting: Veterans Affairs Health Care System (VA) Participants: Individuals with newly diagnosed T2D Exposure: Primary exposure was ASCVD risk based on ACC/AHA Pooled Cohort Equations; secondary exposure was LDL <100 mg/dL in the first year after T2D diagnosis among statin initiators Main Outcomes and Measures: Co-primary outcomes were initiation of statin therapy and achievement of LDL <100 mg/dL within 5 years of diabetes diagnosis; incident 5-year ASCVD was a secondary outcome. Results: Among 100,406 individuals with newly diagnosed T2D, 59,615 were prescribed statin therapy within five years (59.4%), and 44,783 (57.5%) of those with LDL above goal achieved LDL <100 mg/dL within 5 years. Relative to those at low (<7.5%) 10-year ASCVD risk, individuals at intermediate (7.5-20%) and high (>20%) risk were more likely to be initiated on a statin (intermediate: Hazard Ratio [HR] 1.14 [95% CI 1.11, 1.17]; high: HR 1.16 [95% CI 1.13, 1.19]) and to achieve LDL <100 mg/dL (intermediate: HR 1.23 [95% CI 1.19, 1.26]; high: HR 1.34 [95% CI 1.30, 1.38]). Among those prescribed a statin within one year of diabetes diagnosis, achieving LDL <100 mg/dL in the first year after diabetes diagnosis was associated with lower risk of 5-year incident ASCVD (HR 0.84 [95% CI 0.77, 0.92]). Conclusions and Relevance: Gaps in guideline-directed primary prevention of ASCVD arise early following initial diabetes diagnosis. Guideline recommended early LDL lowering among statin initiators was associated with improved clinical outcomes.
von Itter, M.-N.; Grune, E.; Nonnenmacher, T.; Rach, S.; Flis, M.; Haueise, T.; Weiss, J.; Brenner, H.; Keil, T.; Roden, M.; Schulze, M. B.; Schulz-Menger, J. E.; Völzke, H.; Stefan, N.; Schlett, C. L.; Kauczor, H.-U.; Machann, J.; Bamberg, F.; Nattenmüller, J.; Norajitra, T.; Rospleszcz, S.
Show abstract
Background and Aims: Steatotic liver disease (SLD) has high clinical and public health relevance. Robust population estimates of SLD and its subcategories are challenging due to the limitations of ultrasound measurements or non-invasive scores, particularly for low-grade steatosis. We aimed to quantify SLD prevalence using magnetic resonance imaging (MRI) in the population-based German National Cohort (NAKO). Methods: Hepatic multi-echo Dixon MRI was performed at 5 dedicated study sites with identical setup across Germany. Liver fat (proton density fat fraction, PDFF), R2* as proxy for liver iron, and liver volume were assessed. The resulting data of N = 29'842 individuals (age range 20-72 years) were weighted by survey weights for regional representativeness, resulting in a sample of 50% women and a mean age of 45.6 years. SLD was defined as PDFF [≥] 5.75%, and sex-specific prevalence according to age, BMI, socioeconomic status and geographic region was calculated. Results: Overall, SLD prevalence was 21.3% in women and 35.7% in men, and the majority were metabolic dysfunction-associated (MASLD, 89.3% of all SLD cases). Prevalence increased with age in a sex-specific pattern, suggesting potential menopausal effects in women. There was a relevant prevalence of SLD in individuals with normal weight (5.3% in women, 13.2% in men) and the age group <25 years (7.5% in women, 11.9% in women). Differences in prevalence between low and high socioeconomic status were more pronounced in women (37% vs 15.8%) compared to men (45.5% vs 30.3%). Conclusions: Data underscore the high public health relevance of SLD and its subcategory MASLD. The considerable prevalence in groups historically considered low-risk, such as younger or lean individuals, emphasizes the need for raising awareness early.